fix(sync): prevent concurrent-session rollbacks from destroying writes on in-memory SQLite by phernandez · Pull Request #989 · basicmachines-co/basic-memory

phernandez · 2026-06-12T13:20:09Z

Failure 1: test_sync_entity_circular_relations — lost relation row (the important one)

Root cause

The in-memory SQLite URL (sqlite+aiosqlite://) falls back to SQLAlchemy's StaticPool, which hands the same DBAPI connection to every concurrently checked-out session. All concurrent asyncio tasks therefore share one SQLite transaction scope:

a COMMIT by one session commits every sibling task's half-done statements;
a ROLLBACK by one session silently destroys every sibling task's executed-but-uncommitted statements.

Rollbacks are routine, not exceptional: the pool's reset-on-return issues a real DBAPI ROLLBACK at every connection checkin. A probe counting DBAPI calls during one sync of the two-file circular-relations scenario measured 40 commits and 41 rollbacks on the shared connection. scoped_session's exception handler is a second rollback source (any per-file error swallowed by _run_bounded).

During _index_changed_files → BatchIndexer.index_files, per-file upserts run concurrently (asyncio.gather bounded by index_entity_max_concurrent). The losing interleaving:

task A's session finishes: DBAPI COMMIT (separate aiosqlite roundtrip);
task B's add_all_ignore_duplicates executes INSERT INTO relation (B→A, in the gap between A's commit and A's checkin — both are independent await points);
task A's connection checkin fires reset-on-return ROLLBACK → B's INSERT is erased;
task B "commits" an empty transaction. No error anywhere; sync reports success.

The final sync-level resolve_relations() pass cannot catch this: "pending" means a relation row with to_id IS NULL, and the destroyed INSERT leaves no row at all. Hence len(entity_b.outgoing_relations) == 0. #938 exposed the race by removing the multi-second per-entity embedding work that previously kept the upsert tasks from interleaving tightly. The window is microseconds, which is why it fired once on a loaded 2-core runner and never locally (30 pytest runs + 500 in-process iterations clean).

Production is not affected: file-based SQLite uses per-session pooled connections (isolated transactions, WAL) and Postgres uses NullPool. DatabaseType.MEMORY is only used by the test environment via engine_session_factory — but the unsafe engine configuration is product code in db.py.

Fix

_create_sqlite_engine now gives MEMORY engines a single-connection blocking pool (AsyncAdaptedQueuePool, pool_size=1, max_overflow=0). The lone connection keeps the in-memory database alive for the engine's lifetime (the reason StaticPool was chosen), while the blocking checkout serializes sessions at transaction granularity — restoring the isolation the repositories assume. No sleeps, no retries; nested-session misuse now fails loudly (pool checkout timeout) instead of silently sharing transactions, and one such test (test_delete_entity_with_relations) was un-nested.

Evidence

New regression test tests/db/test_memory_db_session_isolation.py fails deterministically on StaticPool (relation count 0 after a sibling session rolls back) and passes with the serialized pool.
Full unit suite green after the change (2946 passed); 200 in-process iterations of the circular-relations sync scenario green.

Failure 2: test_mcp_sse_forces_local — KeyError: 'FORCE_LOCAL' (Python 3.14 leg)

Root cause

The stdio variants of TestMcpCommandRouting invoke bm mcp, whose command body starts a real background auto-update daemon thread before mcp_server.run (the tests only mock run). That thread hits PyPI (5s timeout) and then rewrites config.json (auto_update_last_checked_at) with an in-place truncating Path.write_text. When that write lands while a later test's CLI invocation reads config.json (app callback → CliContainer.create() → load_config()), the reader sees empty/partial JSON and load_config() raises SystemExit. CliRunner.invoke swallows it, the mocked mcp_server.run never executes, and the test fails with KeyError: 'FORCE_LOCAL' — exactly the CI shape. Nothing is 3.14-specific; the leg only shifted timing (the test also never inspected result, discarding the evidence).

The torn write is also a real production race: the MCP stdio server re-reads config.json on mtime change and exits the process on invalid JSON if it races a CLI save.

Fix

save_basic_memory_config writes a per-process/per-thread sibling temp file and publishes via os.replace — readers always see a complete document. Regression test fault-injects an interrupted write; fails against the old in-place write.
The stdio routing tests stub run_auto_update (no PyPI call / config write leaking across tests), and all four transport tests assert result.exit_code == 0 so future pre-run failures surface the real error.

Verification

uv run pytest tests — 2946 passed, 33 skipped
uv run pytest test-int/cli — 69 passed
Both regression tests proven to fail with their fix stashed and pass with it restored
uv run ruff check . (changed files clean; pre-existing hermes findings untouched), uv run ruff format ., uv run ty check src tests test-int — clean

🤖 Generated with Claude Code

Fixes #940

…s cannot destroy writes Root cause of the test_sync_entity_circular_relations CI failure (#940, len(entity_b.outgoing_relations) == 0): the in-memory SQLite URL (sqlite+aiosqlite://) falls back to SQLAlchemy's StaticPool, which hands the same DBAPI connection to every concurrently checked-out session. Concurrent asyncio tasks therefore share one SQLite transaction scope. A rollback issued through one session — scoped_session's exception handler, or the pool's reset-on-return at connection checkin (~40 real ROLLBACKs per sync, measured) — also rolls back any other task's executed-but-uncommitted statements. During batch indexing, per-file tasks run concurrently (asyncio.gather bounded by index_entity_max_concurrent). If a sibling session's checkin ROLLBACK lands between one task's relation INSERT and its COMMIT, the relation row is silently erased: no error is raised, the sync reports success. The final sync-level resolve_relations() pass cannot heal this because it only re-queries rows with to_id IS NULL — the destroyed INSERT leaves no row at all. Fix: give MEMORY-type engines a single-connection AsyncAdaptedQueuePool (pool_size=1, max_overflow=0). The lone connection keeps the in-memory database alive for the engine's lifetime (the reason StaticPool was used), while the blocking checkout serializes sessions at transaction granularity, restoring the isolation the repositories assume. File-based SQLite and Postgres engines are unchanged; production never uses MEMORY engines. The regression test pins the invariant directly: a session that rolls back in one task must never destroy another task's uncommitted writes. It fails deterministically against StaticPool and passes with the serialized pool. tests/repository/test_entity_repository.py held a scoped session open while calling a repository method that opens its own session — tolerated on a shared connection, a deadlock under a serialized pool — so the nested call moved out of the session block. Refs #940 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

… routing tests Root cause of the test_mcp_sse_forces_local KeyError('FORCE_LOCAL') flake (#940, Python 3.14 leg): the stdio variants of the mcp routing tests invoke `bm mcp`, which starts a real background auto-update daemon thread before mcp_server.run (the tests only mock run). That thread hits PyPI and then rewrites config.json (auto_update_last_checked_at) with an in-place Path.write_text, which truncates the file before writing. If that write lands while a later test's CLI invocation is reading config.json (the app callback's CliContainer.create()), load_config() sees empty/partial JSON and raises SystemExit. CliRunner.invoke swallows it, the mocked mcp_server.run never executes, and the test dies with KeyError on env_at_run['FORCE_LOCAL'] — exactly the observed CI failure shape. Nothing is 3.14-specific; that leg only shifted the timing. The same torn write is user-visible in production: the MCP stdio server re-reads config.json on mtime change and load_config() exits the process on invalid JSON if it races a CLI save. Fix: save_basic_memory_config writes a per-process/per-thread sibling temp file and publishes it with os.replace, so readers always observe either the old or the new complete document. The regression test injects an interrupted write and asserts the published config stays untouched; it fails against the old in-place write. Test hardening: the stdio routing tests stub run_auto_update so no PyPI call or config write leaks across tests, and all four transport tests now assert result.exit_code == 0 so a future pre-run failure surfaces its real error instead of a KeyError. Refs #940 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

phernandez and others added 2 commits June 12, 2026 00:24

phernandez merged commit b3bdd59 into main Jun 12, 2026
23 checks passed

phernandez deleted the fix/940-batch-relation-resolution-race branch June 12, 2026 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync): prevent concurrent-session rollbacks from destroying writes on in-memory SQLite#989

fix(sync): prevent concurrent-session rollbacks from destroying writes on in-memory SQLite#989
phernandez merged 2 commits into
mainfrom
fix/940-batch-relation-resolution-race

phernandez commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

phernandez commented Jun 12, 2026

Failure 1: test_sync_entity_circular_relations — lost relation row (the important one)

Root cause

Fix

Evidence

Failure 2: test_mcp_sse_forces_local — KeyError: 'FORCE_LOCAL' (Python 3.14 leg)

Root cause

Fix

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant